Lapse risk modelling in insurance: a Bayesian mixture approach
Authors: Viviana G R Lobo, Thais C O Fonseca and Mariane B Alves
Annals of Actuarial Science
Manuscript ID AAS-2022-0045

This document presents a discussion descriptive statistical analysis for the Telco customer dataset shown in Section 3.3.

Load packages and the dataset

#### import telco customer data and libraries
source("up(telco).R")
## Loading required package: parsnip
## Loading required package: survival
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggsurvfit
## Loading required package: censored
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ readr     2.1.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'plotly'
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## 
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## This is bayesplot version 1.10.0
## 
## - Online documentation and vignettes at mc-stan.org/bayesplot
## 
## - bayesplot theme set to bayesplot::theme_default()
## 
##    * Does _not_ affect other ggplot2 plots
## 
##    * See ?bayesplot_theme_set for details on theme setting
## 
## Loading required package: gridExtra
## 
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

About Telco customer dataset

The Telco customer churn data is available in IBM Business Analytics Community Connect XXXX and contains information about a Telco company that provides home phone and internet services to 7,043 customers in California in the United States. The dataset contains 18 variables about the profile of the customers and a variable indicating who have left or stayed the service and presents an approximate 73.4% censorship rate. The response variable tenure represents the number of months the customer has stayed with the company with average of 32 months. Lifetimes equal zero were removed resulting in a dataset of 7,032 customers. Demographics characteristics are included for each customer, as well as customer account information and service information. Description of the variables as follows:

Demographic Information

Services Information

Customer Account Information

rmarkdown::paged_table(df)

Descriptive Analysis via Kaplan-Meier

We present a descriptive analysis based on empirical marginal survival curves with the Kaplan-Meier estimator and evaluate the hazard curve for some variables including the categories demographics, service information, and customer account information.

dados <-  df
#Kaplan-Meier fit (marginal)
fit.km1 <- survfit2(Surv(tenure, status) ~ gender, dados)
fit.km2 <- survfit2(Surv(tenure, status) ~ PaymentMethod, dados)
fit.km3 <- survfit2(Surv(tenure, status) ~ Contract, dados)
fit.km4 <- survfit2(Surv(tenure, status) ~ SeniorCitizen, dados)
fit.km5 <- survfit2(Surv(tenure, status) ~ InternetService, dados)
fit.km6 <- survfit2(Surv(tenure, status) ~ StreamingTV, dados)
fit.km8 <- survfit2(Surv(tenure, status) ~ Dependents, dados)
fit.km9 <- survfit2(Surv(tenure, status) ~ PaperlessBilling, dados)
fit.km10<- survfit2(Surv(tenure, status) ~ DeviceProtection, dados)
fit.km11<- survfit2(Surv(tenure, status) ~ PhoneService, dados)
fit.km12<- survfit2(Surv(tenure, status) ~ TechSupport, dados)

extrai_km <- function(fit_km) {
    ret <- tidy_survfit(fit_km, type = "surv") %>%
    select(.eval_time = time, .pred_survival = estimate, id = strata) %>%
      group_by(id) %>%
      mutate(haz = tx.emp(.eval_time, .pred_survival)) %>%  
      tidyr::nest(.pred = c(.eval_time, .pred_survival))
      return(ret)
}

kms <- list(Gender=fit.km1, PaymentMethod=fit.km2, Contract=fit.km3, 
            SeniorCitizen= fit.km4, InternetService= fit.km5, 
            StreamingTV= fit.km6,Dependents=fit.km8, PaperlessBilling= fit.km9, DeviceProtection=fit.km10, PhoneService= fit.km11, TechSupport=fit.km12) 
resultado= map(kms, extrai_km)

Gender

preds <-bind_rows(Gender=resultado$Gender, .id = "modelo") %>%
  group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)

a1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

a2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

PaymentMethod

For the payment method see that the behaviour of survival curves and churn rates for Credit Card (automatic), Bank Transfer (automatic) and Mailed Check are quite similar. On the other hand, when we consider the instantaneous lapse curve, notice that customers that preferable for an {Eletronic check} as a payment method are the majority to leave the company. In this case, we consider jointly these classes in a new category called ``others’’.

preds <-   
  bind_rows(PaymentMethod=resultado$PaymentMethod, .id = "modelo") %>%
  group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)

b1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

b2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

Contract

For the customer account information, see that customers with Month-to-Month contracts have higher lapse rates compared to clients with One-year and Two-year contracts. See that the hazard curve are very similar for the contract that has payment more than one year. In this way, we consider to aggregate in a unique group called {``One-year +’’}.

preds <-   
 bind_rows(Contract=resultado$Contract, .id = "modelo") %>%
  group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)

c1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

c2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

SeniorCitizen

preds <-   
 bind_rows(SeniorCitizen=resultado$SeniorCitizen, .id = "modelo") %>%
 group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)

d1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

d2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

InternetService

preds <-   
bind_rows(InternetService=resultado$InternetService, .id = "modelo") %>%
  group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)

e1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

e2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

StreamingTV

preds <-   
 bind_rows(StreamingTV=resultado$StreamingTV, .id = "modelo") %>%
  group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)


f1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

f2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

PaperlessBilling

preds <-   
bind_rows(PaperlessBilling=resultado$PaperlessBilling, .id = "modelo") %>%
  group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)



h1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

h2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

DeviceProtection

preds <-   
 bind_rows(DeviceProtection=resultado$DeviceProtection, .id = "modelo") %>%
 group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)


i1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

i2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

PhoneService

preds <-   
 bind_rows(PhoneService=resultado$PhoneService, .id = "modelo") %>%
 group_by(modelo) %>%
  ungroup() %>%
  tidyr::unnest(cols = .pred)


j1<- ggplot(aes(x = .eval_time, y = .pred_survival, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("survival probabilities")+
  geom_step(linewidth=1) + 
   theme_bw() +
  scale_y_continuous(limits=c(0.2,1), breaks = seq(0.2,1,by=.2), labels = point) +
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

j2<-ggplot(aes(x = .eval_time, y = haz, col=id), data = preds) +
  xlab("survival time (in months)") + ylab("hazard rate")+
  theme_bw() +
  geom_line(linewidth=1) + 
   scale_y_continuous(limits=c(0,0.06), breaks = seq(0,0.06,by=.02), labels = point)+
  scale_x_continuous(limits=c(0,75), breaks = seq(0,75,by=20), labels = grad)

The lapse rate for the two categories of Gender in the data are nearly the same. Similar behaviour is observed for some customer service information such as PhoneService, MutipleLines and OnlineSecurity. Also, there is multicollinearity for groups of variables, in particular OnlineSecurity, TechSupport, OnlineBackup and DeviceProtection are all dependent on the OnlineService variable. Also, the covariates MonthlyCharges and TotalCharges present high collinearity.

sp1= ggplot(aes(x=MonthlyCharges, y=TotalCharges), data=df) +
  theme_bw() +
  geom_point()
ggplotly(sp1)